Skip to content

Fix Unicode token boundaries for non-ASCII “Other” scripts#1

Open
Konf wants to merge 2 commits intodarkskygit:masterfrom
Konf:unicode_fix
Open

Fix Unicode token boundaries for non-ASCII “Other” scripts#1
Konf wants to merge 2 commits intodarkskygit:masterfrom
Konf:unicode_fix

Conversation

@Konf
Copy link
Copy Markdown

@Konf Konf commented Mar 10, 2026

This PR fixes tokenization for all non-ASCII scripts that go through SegmentScript::Other, e.g Cyrillic languages, Greek, Armenian, etc.

Bug summary

  • DefaultTextNormalizer::normalize had explicit split logic for ASCII (normalize_ascii_split), but non-ASCII Other segments were processed as a single token via normalize_span.
  • This missed token boundaries in many languages that rely on whitespace/punctuation separation.
  • As a result, non-ASCII Other text was normalized as one full segment (for example, a phrase with spaces/punctuation), which made search queries fail.

Fix

  • This PR adds normalize_unicode_split function that acts like normalize_ascii_split, but for unicode strings and wires that function for SegmentScript::Other text normalization.
  • Now Other text is split into Unicode word-like spans before normalization, so terms are indexed and matched independently.
  • Also, 4 tests were added to cover and reproduce such issues and don't allow regressions in future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant